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Abstract 

Background: Single nucleotide polymorphisms (SNPs), the most abundant variations in a genome, have been 
widely used in various studies. Detection and characterization of citrus haplotype-based expressed sequence tag 
(EST) SNPs will greatly facilitate further utilization of these gene-based resources. 

Results: In this paper, haplotype-based SNPs were mined out of publicly available citrus expressed sequence tags 
(ESTs) from different citrus cultivars (genotypes) individually and collectively for comparison. There were a total of 
567,297 ESTs belonging to 27 cultivars in varying numbers and consequentially yielding different numbers of 
haplotype-based quality SNPs. Sweet orange (SO) had the most (213,830) ESTs, generating 11,182 quality SNPs in 
3,327 out of 4,228 usable contigs. Summed from all the individually mining results, a total of 25,417 quality SNPs 
were discovered - 15,010 (59.1%) were transitions (AG and CT), 9,1 14 (35.9%) were transversions (AC, GT, CG, and 
AT), and 1,293 (5.0%) were insertion/deletions (indels). A vast majority of SNP-containing contigs consisted of only 2 
haplotypes, as expected, but the percentages of 2 haplotype contigs varied widely in these citrus cultivars. BLAST of 
the 25,41 7 25-mer SNP oligos to the Clementine reference genome scaffolds revealed 2,947 SNPs had "no hits found", 
1 9,943 had 1 unique hit / alignment, 1 ,571 had one hit and 2+ alignments per hit, and 956 had 2+ hits and 1 + align- 
ment per hit. Of the total 24,293 scaffold hits, 23,955 (98.6%) were on the main scaffolds 1 to 9, and only 338 were on 
87 minor scaffolds. Most alignments had 100% (25/25) or 96% (24/25) nucleotide identities, accounting for 93% of all 
the alignments. Considering almost all the nucleotide discrepancies in the 24/25 alignments were at the SNP sites, it 
served well as in silico validation of these SNPs, in addition to and consistent with the rate (81%) validated by sequen- 
cing and SNaPshot assay. 

Conclusions: High-quality EST-SNPs from different citrus genotypes were detected, and compared to estimate the 
heterozygosity of each genome. All the SNP oligo sequences were aligned with the Clementine citrus genome to 
determine their distribution and uniqueness and for in silico validation, in addition to SNaPshot and sequencing validation 
of selected SNPs. 

Keywords: Haplotype, Heterozygosity, Polymorphism, Transition, Transversion, Insertion/deletion, Non-synonymous, 
Synonymous 



Background 

Single nucleotide polymorphism (SNP) refers to an allelic 
single-base variation between two haplotype sequences in 
an individual or between any paired homologous chromo- 
somes across homogenous members. SNPs are most 
abundant among genomic DNA variations and ubiquitous 
in both functional genes and non-coding regions [1]. 
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Because they are conserved during evolution, associated 
with genetic traits, and suited for high throughput geno- 
typing, SNPs are a popular and powerful tool for various 
genetics and genomics studies, such as mapping of whole 
genomes, tagging of important traits, comparison of gen- 
ome evolution, classification of diverse clades, and many 
rapidly developing areas such as pharmacogenomics and 
functional proteomics [2-4]. These SNPs from expressed 
sequence tags (ESTs) represent hundreds of thousands of 
functional genes and likely control many genetic traits 
[5-8]. Due to degeneracy of most three-nucleotide genetic 
codons, a SNP in the coding regions may be synonymous 
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(sSNP) if it does not result in change of the protein 
sequence or non-synonymous (nsSNP) if it does. The 
nsSNPs are usually more biologically relevant because the 
resulting amino acid changes in proteins may change their 
secondary structures and functions and cause phenotypic 
mutations [1,8,9]. 

SNP discovery usually is accomplished through com- 
putational alignment of redundant DNA sequences with 
each other or with a high-quality reference genome 
where discrepant nucleotides can be detected and evalu- 
ated. For the redundancy-based computational approach, 
in addition to sequencing errors as a source of false 
SNPs [5,7,10], it may be even more challenging to distin- 
guish real SNPs among allelic sequences from single nu- 
cleotide discrepancies among highly identical paralogous 
sequences [8,11]. Several bioinformatics programs (pipe- 
lines) have been developed for automatic SNP mining, 
using different input data, computational algorithms, 
quality evaluation strategies, and/or output formats. For 
example, the PolyPhred and PolyBayes pipeline typically 
requires sequence trace files or extracted sequences with 
base calling quality values to minimize false SNPs result- 
ing from sequencing errors [12-14]. PolyBayes also in- 
cludes an extra implementation to identify paralogs and 
their derived false SNPs [13]. Others like autoSNP and 
QuailitySNP can accept sequences without quality files 
for initial redundancy-based detection, and then grade 
SNPs by confidence levels, which are more commonly 
used with public ESTs that usually do not have trace or 
quality files [8,15]. The QualitySNP pipeline implements 
a haplotype reconstruction algorithm and confidence 
scoring approach to detect reliable synonymous and 
non-synonymous SNPs from public ESTs without quality 
files and a reference genome [8]. In other words, it re- 
clusters ESTs in a contig to determine the potential hap- 
lotypes in the contig. Only single discrepant nucleotides 
between any two reconstructed haplotypes would be 
scored a potential SNP. Sequencing differences can also 
result from sequencing errors or alignment of paralogs. 
Only those potential SNPs passing additional confidence 
interrogation are identified as quality SNPs. Reliable 
quality SNPs represent the different alleles (haplotypes) 
of a gene. As opposed to low-confidence and false SNPs, 
the use of quality SNPs can benefit allele-trait associ- 
ation studies [8]. 

Most citrus species are diploid (2n = 2x = 18), with 
highly heterozygous and relatively small genomes and 
over 30,000 predicted genes [16]. In general, citrus refers 
to true biological species and ancestrally domesticated 
introgressions in Citrus and those in the sexually com- 
patible Fortunella (kumquat) and Poncirus (trifoliate or- 
ange) genera. Citrus fruit types are diverse, and include 
sweet orange {Citrus sinensis), mandarin (C. reticulata), 
grapefruit (C. paradisi), lemon (C. limon), lime 



(C. aurantifolia), pummelo (C. maxima), and citron 
(C. medico). Each type consists of many cultivars 
primarily selected from spontaneous bud sports, 
chance seedlings, induced mutants, or conventional 
hybrids. It is widely believed that only C. maxima, 
C. reticulata, and C. medica are true species, although 
the binomial names for the other ancestral hybrid and 
introgression cultivars are widely accepted and used 
[17,18]. These citrus types likely vary in levels of heterozy- 
gosity and share alleles resulting from early introgressions 
across these genomes, according to SSR markers [19-21]. 
A haploid Clementine genome sequence was produced 
using Sanger technology, and one diploid sweet orange 
genome using Roche 454 technology [22], along many 
other citrus genomes using other re-sequencing platforms 
(Gmitter et al. unpublished data). Together with other 
available citrus genomic resources, it is now possible for 
SNP detection and comparison of large-volume citrus 
Sanger EST datasets within and among different citrus 
cultivars. These gene-based SNPs, once available for the 
citrus community, will be very valuable in many genetic 
and genomic studies, and helpful for trait-targeted 
breeding as well [20,21,23]. 

In this paper, SNPs in public ESTs from 27 different 
citrus genotypes were detected by the QualitySNP pipe- 
line and compared to estimate the heterozygosity of each 
genome. All of the short SNP oligo sequences were also 
aligned with the Clementine citrus genome to determine 
their distribution and uniqueness in the genome and for 
in silico validation. Selected SNPs were also validated by 
SNaPshot and sequencing. 

Methods 

Citrus ESTs and cultivars 

All citrus ESTs were retrieved from the National Center 
of Biotechnology Information (NCBI) EST database or 
ftp repository if available. There were 27 citrus cultivars 
or biotypes with ESTs (Table 1, Additional file 1). In 
addition to the binomial and common names, the abbre- 
viations for 27 cultivars were designated to facilitate 
presentation (Table 1, Additional file 1); the binomial 
names are those used for the accessions in the NCBI 
database. ESTs were searched for SNPs using the Quality 
SNP pipeline [8] in each of the 27 cultivars and in three 
cultivar groups, 12 mandarins (M12), 7 limes/lemons/cit- 
ron (L7), and all 27 cultivars (C27). The mining results for 
individual cultivars in the three groups were summed, giv- 
ing SM12, SL7, and SC27, respectively used to compare 
with of M12, L7, and C27 (Additional file 1). 'Ridge Pine- 
apple' sweet orange (Citrus sinensis) was selected for SNP 
validation because the most ESTs and SNPs are from 
sweet orange and it is a parent to several widely used 
mapping populations. 
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Table 1 Public ESTs in citrus cultivars/biotypes 



No 


Binomial names 


Common names 


Abbreviations 


EST numbers 


1 


Citrus sinensis 


Sweet orange 


SO 


213,830 


2 


C. Clementina 


Clementine mandarin 


CM 


1 22,005 


3 


C. reticulata 


Ponkan mandarin 


PM 


52,340 


4 


C. unshiu 


Satsuma mandarin 


SM 


19,072 


5 


C. reshni 


Cleopatra mandarin 


LM 


5,768 


6 


C. sunki 


Hayata mandarin 


HM 


5,216 


/ 


C. tamurana 


Rixiangxia mandarin 


RM 


358 


8 


C. hassaku 


Hassaku mandarin 


KM 


151 


9 


C. natsudaidai 


Summer orange 


UM 


202 


10 


C. reticulata x C. temple 


Orah tangor 


0T 


5,823 


1 1 


C. Clementina x C. reticulata 


Fortune tangor 


FT 


1,917 


12 


C. nobilis x C. kinokuni 


Kankitsu Chukanbohon Nou 6 Gou tangor 


KT 


645 


13 


C. sinensis x C. reticulata 


Amakusa tangor 


AT 


160 


14 


C. limonia 


Rangpur lime, Mandarin lime 


ML 


11,045 


15 


C. latifolia 


Tahiti lime 


TL 


8,756 


16 


C. aurantifolia 


Mexican lime 


KL 


8,219 


17 


C. limettioides 


Palestine Sweet lime 


SL 


8,188 


18 


C. limon 


Lisbon lemon 


LL 


1,505 


19 


C. jambhiri 


Rough lemon 


RL 


1,017 


20 


C. medica 


Etrog citron 


EC 


1,115 


21 


C. aurantium 


Sour orange. Bitter orange 


BO 


14,584 


22 


C. paradisi 


Grapefruit 


GF 


8,039 


23 


C. macrophylla 


Alemow pepada 


AP 


1,929 


24 


C. paradisi x P. trifoliata 


Swingle citrumelo 


SC 


7,954 


25 


C. sinensis x P. trifoliata 


Carrizo citrange 


CC 


1,837 


26 


Fortunella margarita 


Nagami kumquat 


NK 


2,924 


27 


Poncirus trifoliata 


Trifoliate orange 


TO 


62,695 




2-13 combined 




M12 


213,660 




14-20 combined 




L7 


39,845 




1-27 combined 




C27 


567,297 



All ESTs are retrieved from the NCBI repository. Those in bold font are over 8,000 ESTs. All mandarin (No. 2-13) and lime/lemon (No. 14-20) types were listed 
together. The abbreviation for each cultivar and total was designated to facilitate presentation. 



SNP discovery and primer design 

The QualitySNP pipeline was installed and used for SNP 
discovery, following the program manual and recom- 
mended parameters [8]. QualitySNP first identified hap- 
lotypes in a contig by re-clustering its ESTs and 
extracted all nucleotide discrepancies (called potential 
SNPs, pSNPs) between identified haplotypes in a contig, 
from which a subset of so-called quality SNPs (qSNPs) 
was identified based on allele and SNP confidence scores 
defined in the haplotype-based mining algorithm [8]. 
These qSNP-containing contigs and 25-mer oligo se- 
quences, along with much other mining information, 
were saved in separate files for database construction 
and result summary. The ratios of qSNP/pSNP were 



calculated to indicate the percentage of nucleotide dis- 
crepancies (pSNPs) identified as high-qaality SNPs 
(qSNPs) by the QualitySNP algorithm. Bioinformatics 
programs included in the pipeline were cross_match in 
the phred-phrap-consed package [24,25] to remove vec- 
tors, CAP3 [26] to assemble ESTs, FASTY [27] to align 
ESTs to the proteins in the Uniprot database for identifi- 
cation of non-synonymous and synonymous SNPs. 
BatchPrimer3 [28] was used to design a forward (F), a 
reverse (R), and a single base extension (SBE) primer 
flanking each SNP site. The F, R and SBE primers of 96 
SNPs from SO were selected for both sequencing and 
SBE genotyping validation (Additional file 2). After sort- 
ing by the lengths of SBE primers, except the first, the 
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other 7 primers of every 8 SBE primers were tailed in 
the 5' end with three groups of non-homologous polynu- 
cleotides of different lengths to facilitate future multiplex 
genotyping application. All the F, R and tailed SBE 
primers, 96 each, were synthesized by Eurofins MWG 
Operon (Huntsville, Al) in a 96-well plate, respectively, 
where every three primers of each SNP were placed in 
the same well of the three different plates and stored in 
ddH 2 0 at 10 uM. The format facilitated easy primer po- 
sitioning and channel pipetting during the genotyping 
and sequencing preparation. 

SNP 25-nucleotide sequence blast 

All 25-nucleotide oligo sequences (SNP in the middle 
nucleotide) generated from every citrus genotype by 
QualitySNP were combined together and used to align to 
the haploid Clementine reference genome (version 1.0; 
phytozome.org and citrusgenomedb.org) using BLASTN 
[29] and a cut-off e-value of 6e-004 (0.0006). Each query 
sequence (25-mer oligo) against the subject scaffolds 
would yield either of the following BLASTN outputs, "no 
hits found", 1 hit on 1 scaffold with 1 alignment, or any 
other cases (i.e., 1 hit on 1 scaffold with 2+ alignments at 
different positions or 2+ hits on different scaffolds with 1+ 
alignment each hit). At the preset e value, only alignments 
with 84% identities and higher (in other words, only 6 
types of alignment hits: 25/25, 24/25, 24/24, 23/23, 22/22, 
and 21/21), were saved in the BLASTN output file. The 
information in the output file, including the scaffold, pos- 
ition, strand, e value, score, alignment identities of each 
hit, and hit status, was parsed into an EXCEL file to 
summarize SNP alignment status and to calculate distri- 
bution on the Clementine reference genome scaffolds. 
The information was also used as additional criteria for 
categorization of SNPs and selection of desired core sets. 

SNP validation by sequencing and SNaPshot 
genotyping assay 

BigDye Terminator V3.1 Cycle Sequencing Kit and 
SNaPshot Multiplex Kit (Applied Biosystems, Foster 
City, CA) were used to validate SNPs, following the 
manufacturer's protocols with some modifications in re- 
action volumes and/or quantity of proprietary reagents. 
96-well plates were used for PCR, enzymatic incubation, 
and denaturation on iCycler (Bio-Rad, Hercules, CA) 
and/or GeneAmp PCR System 9700 (Applied Biosys- 
tems, Foster City, CA), and for genotyping and sequen- 
cing on 3130x1 Genetic Analyzer (Applied Biosystems, 
Foster City, CA). Unless otherwise stated, brief centrifu- 
gation up to 1000 rpm in Juan MR 23i was applied after 
addition of a solution or before implementation of new 
steps, and all the PCR and enzymatic incubation pro- 
grams were set to hold at 4°C indefinitely at the end 
until a next procedure. 



For both dye terminator sequencing and SNaPshot as- 
says to validate SNPs, template preparation was carried 
out in 10 ul in each well consisting of 3.3 \il ddH 2 0, 
1.0 ul lOx dNTPs (2 mM), 2.0 ul 5x colorless GoTaq 
Flexi buffer, 0.8 ul 25 mM MgCl 2 , 0.4 ul F and R primers 
each, 0.1 ul GoTaq Flexi (5 units per \i\ Promega, Madison, 
WI), and 2 ul genomic DNA (10 ng/ul). The touch-down 
PCR program started from an initial denaturation at 94°C 
for 3 min, followed by 10 cycles of 93°C for 30 sec, 56°C 
for 45 sec (decreasing 0.5°C each annealing step), 72°C for 
45 sec, and 30 continuing cycles with 51°C at the anneal- 
ing step, plus a final elongation at 72"C for 15 min. Re- 
moval of primers and unused dNTPs was performed by 
addition of 1 ul of ExoISAP-IT (Affymetrix, Santa Clara, 
CA) into each well of the plate, and incubation at 37°C for 
60 min and 75°C for 15 min. 

Sequencing reactions for SNP validation were pre- 
pared in 10 ul in each well of a new plate including 2 ul 
5x sequencing buffer, 2 ul ready reaction premix in the 
sequencing kit, 1 ul 10 uM SNP F primer, and 5 ul 
ExoSAP-IT treated PCR product, started at 95°C for 
1 min, followed by 25 thermal cycles of 95°C for 10 sec, 
50°C for 5 sec, and 60°C for 4 min. Following the manu- 
facturer's instructions, ethanol/EDTA/sodium acetate 
precipitation was used to purify the sequencing product 
in the plate, which was subsequently air dried, then 
mixed with 2 \il ddH 2 0 and 6 ul Hi-Di formamide in 
each well, denatured, and loaded to the genetic analyzer 
to sequence. The sequence files generated were analyzed 
by Sequencing Analysis software (Applied Biosystems, 
Foster City, CA) to generate sequences and electrophe- 
rograms, in which a validated SNP was confirmed by 
correct alignment of SBE primer sequence into the cor- 
responding sequences and visualization of two different 
overlapped nucleotide peaks at the nucleotide site in the 
electropherograms. 

The SBE reaction for SNaPshot assays was prepared in 
5 ul in each well in a new plate including 0.5 ul ready re- 
action premix in the SNaPshot kit, 1 [A SBE 10 uM pri- 
mer, and 3.5 ul ExoSAP-IT treated PCR product, and 
repeated in 25 thermal cycles of 95°C for 10 sec, 50°C 
for 5 sec, and 60°C for 30 sec. Removal of unincorpor- 
ated dye-labeled ddNTPs was completed by addition of 
5 ul SAP mix (3.5 ul ddH 2 0, 1.0 ul lOx SAP buffer, and 
0.5 ul lu/ul SAP) into the SBE reaction mix, and incuba- 
tion at 37°C for 60 min and 75°C for 15 min. Genotyping 
was performed using 8 ul mix in each well of a new plate 
consisting of 1 ul SAP treated SBE product, 0.25 ul Gene 
Scan 120 LIZ size standard, and 6.75 ul Hi-Di formamide, 
which was denatured at 95°C 3 min then immediately 
moved on ice for at least 2 min. The SNaPshot files were 
used to score SNPs by GeneMarker (SoftGenetics, State 
College, PA) in which a validated SNP consisted of two 
different nucleotides. 
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Results 

Haplotype-based EST-SNPs in citrus cultivars 

Haplotype-based SNPs were mined from ESTs of the 27 
citrus cultivars and 3 groups (M12 - 12 mandarins, L7 - 
7 limes/lemons, and C27 - all 27 combined) using the 
QualitySNP pipeline and summarized in detail (Additional 
file 1). In summary (SC27 - the last column in Additional 
file 1), a total of 25,417 qSNPs (Additional file 2) were 
identified from ESTs of the 27 cultivars mined separately. 
These are attributed to heterozygosity within cultivars at 
SNP loci. There were only 2805 SNPs duplicated accord- 
ing to comparison of all the 25-mer oligo sequences. The 
percentages of the 7 SNP types were similar among most 
citrus cultivars with each type of quality SNPs found. 
Among the 25,417 qSNPs summed from the 27 citrus cul- 
tivars, 15,010 (59.1%) were transitions (AG and CT), 9,114 
(35.9%) transversions (AC, GT, CG, and AT), and 1,293 
(5.0%) insertion/deletion events (indels). On average, there 
were 2.4 SNPs per contig and one SNP every 1,064 bp in 
all of the SNP-containing contig sequences (Figure 1; 
Additional file 1). 

For individual cultivars, their numbers of ESTs were 
different, so consequentially were their quality SNPs and 
other related numbers. For example, in SO, 213,830 
ESTs yielded 7,404 contigs of >=4 ESTs. Of these, 4,228 
contigs contained 43,655 potential SNPs and 3,327 con- 
tained qSNPs. The total number of qSNPs was 11,182. 
In other words, there was only one haplotype detected 
in 3,176 contigs (7,404 minus 4,228) and no quality SNP 
identified in the additional 1,001 contigs (4,428 minus 
3,327) with potential SNPs. There were 3.4 quality SNPs 



per contig and one quality SNP per 723 bp in the con- 
tigs on average. Of these 11,182 qSNPs, 6,822 (61.0%) 
were transitions (AG and CT type), 3,879 (34.7%) trans- 
versions (AC, GT, CG, and AT type), and 481 (4.3%) in- 
sertion/deletion (Indels); and 2,619 (23.4%) were nsSNPs 
and 4,038 (36.1%) were sSNPs. The absolute numbers of 
quality SNPs were not comparable due to varying num- 
bers of ESTs among citrus cultivars, but the number of 
potential and quality SNPs from each cultivar were 
strongly correlated with its number of ESTs; more ESTs 
yielded more usable contigs (>=4 ESTs) available for 
SNP mining, as well as more quality SNPs (Additional 
file 1). Given the large differences in the numbers of 
ESTs available among the various cultivars, it is more in- 
teresting to compare SNP frequencies, rates, and ratios 
among cultivars with substantial EST numbers and dis- 
tinct genetic backgrounds, and differences between the 
mining results of the three grouped ESTs (M12, L7, and 
C27) and the three sums/averages (SM12, SL7, and SC27) 
of separately mined counterpart individuals. These com- 
parisons will be elaborated hereafter. 

Haplotypes detected in contigs with SNPs 

One important feature of QualitySNP is to re-cluster 
ESTs in a contig to reconstruct and determine the hap- 
lotypes in that contig, from which only single nucleotide 
discrepancies between any two defined haplotypes (al- 
lelic sequences) are considered as potential SNPs for fur- 
ther quality and confidence interrogation. Only those 
potential SNPs passing confidence scores are identified 
as quality SNPs. In Additional file 1, all the haplotypes 
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Figure 1 Percentages of the 7 SNP types, AG, CT, AC, GT, CG, AT, and indel, discovered from citrus ESTs. Presented here are 9 selected 
citrus cultivars, 3 groups, and 3 sums. SO, Sweet orange; CM, Clementine mandarin; PM, Ponkan mandarin; SM, Satsuma mandarin; ML, Rangpur 
lime; BO, Sour orange; GF, Grapefruit; NK, Nagami kumquat; TO, Trifoliate orange; M12, SNPs from ESTs combined from 12 mandarins (2-13 in 
Table 1), L7, SNPs from ESTs combined from 7 limes / lemons (14-20 in Table 1); C27, SNPs from all ESTs combined (1-27 in Table 1); SM12, SL7 
and SC27, the respective sum of the 12 mandarins, 7 limes/lemons, and all 27 cultivars. On the average of the 27 cultivars (SC27), transitions 
(AG and CT) account for 59.1%, transversions (AC, GT, CG, and AT) for 35.9%, and insertion/deletions (indels) for 5.0%. 
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detected in the SNP-containing contigs from all the 27 
citrus cultivars are included. Theoretically, there should 
be only a maximum of 2 haplotypes detected in a diploid 
genome. As expected, a vast majority of SNP-containing 
contigs consisted of two haplotypes, but the percentages 
of 2 haplotypes varied in a wide range in these citrus 
cultivars (Figure 2, Additional file 1). Among the highest 
were ML (92%), SC (84%), and GF (76%), and among the 
lowest PM (38%), KL (42%), and CM (48%). The vari- 
ation likely results from the genetic makeup of the "cul- 
tivar" used to generate the ESTs. For example, ESTs for 
SO came from navel oranges, blood oranges, and others 
named C. sinensis, rather than a single genotype. In con- 
trast, other "cultivars" are likely single clones. It was also 
evident as expected that much lower percentages of 2 
haplotypes were found in three combined EST datasets 
(M12, 44%; L7, 70%; and C27, 34%) due to introduction 
of more haplotypes from different types of citrus culti- 
vars, compared to their counterpart averages of each 
group (SM12, 48%; SL7, 74%; and SC27, 53%). As a con- 
sequence, more qSNPs in higher qSNPs/pSNPs and 
qSNPs/ESTs ratios were found in the three grouped EST 
datasets (M12, L7, and C27), compared to their counter- 
parts (SM12, SL7, and SC27) summed from the indi- 
vidually mined cultivar EST results, but the ratio of 
contigs with qSNPs and contigs used was the opposite 
(Figure 3, Additional file 1). The frequency of qSNPs is 
much higher in the pooled data for the three groups 
(Ml 2, L7 and C27) than in the summed data for individ- 
ual cultivars. This is because the group values include 
polymorphism among homozygous accessions as well as 
heterozygosity within cultivars, while the summed data 



include only SNPs due to heterozygosity. In other words, 
the nucleotide at such a SNP is very likely homozygous 
within a genotype, making it useless in genetic linkage 
mapping of that genotype. 

Alignment and distribution on the Clementine 
reference genome 

A total of 25,417 25-mer sequences (query sequence, 
Additional file 2) with quality SNPs from all the 27 cit- 
rus cultivars were used to align to the Clementine refer- 
ence scaffolds (subject sequence) using BLASTN at a 
cut-off e-value of 6e-004 (Table 2). 2,947 sequences had 
"no hits found" and 22,470 one or more hits. Of the 
22,470 SNPs with hits, 19,943 had only 1 scaffold hit 
with only 1 alignment on the scaffold, 1,571 had 1 scaf- 
fold hit but >=2 alignments on the scaffold (3 alignments 
per scaffold hit on average), and 956 had>=2 scaffold 
hits (~3 hits per oligo on average) with 1 or more align- 
ments on each of the scaffolds (~7 alignments per scaf- 
fold hit or ~20 alignments per oligo on average). It 
suggested the 19,943 25-mer oligo sequences appear to 
be unique in the genome, and the remaining 2,527 25- 
mer sequences may have duplicated or similar sequences 
with at least 84% identities at different locations in the 
genome. There was one extreme case that one 25-mer 
sequence from trifoliate orange yielded 29 scaffold hits 
and 2,162 alignments on all the scaffolds, the highest 
numbers of all. 

Taking these multiple scaffold hits and alignments into 
account, the total number of scaffold hits was 24,293 
with a total of 43,668 alignments on the scaffolds. Most 
had 100% (25/25) or 96% (24/25) nucleotide identities to 



1 2 ■ 3 4 a>=5 




SO CM PM SM ML BO GF NK TO M12 L7 C27 SM12 SL7 SC27 

Selected 9 citrus varieties, 3 groups, and 3 sums 

Figure 2 Percentages of detected haplotype numbers (2, 3, 4, and >=5) in contigs (>=4 ESTs) with potential SNPs. Presented here are 5 
selected citrus cultivars, 3 groups, and 3 sums. SO, Sweet orange; CM, Clementine mandarin; PM, Ponkan mandarin; SM, Satsuma mandarin; ML, 
Rangpur lime; BO, Sour orange; GF, Grapefruit; NK, Nagami kumquat; TO, Trifoliate orange; M12, SNPs from ESTs combined from 12 mandarins 
(2-13 in Table 1), L7, SNPs from ESTs combined from 7 limes/lemons (14-20 in Table 1); C27, SNPs from all ESTs combined (1-27 in Table 1); 
SMI 2, SL7 and SC27, the respective sum of the 12 mandarins, 7 limes/lemons, and all 27 cultivars. 



Chen and Gmitter BMC Genomics 2013, 14:746 Page 7 of 1 1 

http://www.biomedcentral.com/1471-2164/14/746 



■ M12 ■ SM12 ■ L7 SL7 n C27 ■ SC27 

80.00%-, — 



70.00% 




qSNPs/pSNPs qSNPS/ESTs Contigs qSNPs / Contigs used 

Figure 3 Comparisons between M12 vs. SMI 2, L7 vs. SL7, and C27 vs. SC27, respectively in three ratios. There are three ratios presented 
as percentage, qSNPs, the number of quality SNPs; pSNPs, the number of potential SNPs; ESTs, the number of ESTs; contigs qSNPs, the number of 
contigs with qSNPs; contigs used, the number of contigs with >=4 ESTs. M12, L7 and C27 are mined from grouped ESTs from the corresponding 
cultivars, and SM12, SL7, and SC27 summed from individually mined cultivars used in the grouped counterparts, respectively. 



those on the reference genome, accounting for 93% of 
all the alignments. Almost all the nucleotide discrepan- 
cies in the 24/25 alignments were at the SNP sites, 
which is an encouraging in silico validation of these 
SNPs. Of the total 24,293 scaffold hits, 23,955 were on 
main scaffolds 1 to 9 (2,122, 2,804, 4,159, 2,813, 3,045, 
2,501, 1,861, 2,308, and 2,342, respectively), accounting 
for 98.6% of the total. The remaining 338 were on 87 
small scaffolds. Figure 4 showed the distribution of SNPs 
with all and unique hits from SO, TO, and CM on scaf- 
fold^ of the haploid Clementine genome (similar figures 
on scaffold_2 are in Additional file 3). According to the 
aligned SNP counts on each 500 kb, there were some fea- 
tured regions (intervals in Figure 4). For example, in SO 
many fewer unique hits were found in the middle region, 
compared to those in two arm regions. Relatively even dis- 
tribution was observed in CM, with exceptions at Interval 
5 with overwhelming duplicated hits of certain SNPs 
(similar to the same region in SO). There were very lim- 
ited unique SNPs aligned at Interval 20-27 of all the three 



cultivars, suggesting the region may contain the centro- 
mere, usually characterized by fewer genes. These results, 
combined with other criteria, should greatly facilitate se- 
lection of well-distributed core sets of SNPs across citrus 
genomes for different genotyping applications and genetic 
studies. 

SNP validation by sequencing and SNaPshot 
genotyping assay 

Of the 96 randomly selected sweet orange SNPs, 68 were 
validated by sequencing and 74 by SNaPshot in sweet 
orange (Additional file 4). There were 61 validated by 
both assays and the remainder validated by only one 
assay. In other words, 7 were validated by only sequen- 
cing but failed in SNaPshot, and 13 by only SNaPshot 
but failed in sequencing. Therefore, a total of 81 SNPs 
(84%) were validated by at least one of the two assays. 
The high rate (84%) of validated SNPs was consistent 
with 93% alignments onto the reference genome with 
100% (25/25) or 96% (24/25) identities (Table 2), indicating 



Table 2 BLASTN results of 25,417 25-mer oligo sequences 





25-mers 


Hits 


Alns 


25/25 


24/25 


24/24 


23/23 


22/22 


21/21 


No hits found 


2,947 


















1 hit (1 aln) 


1 9,943 


1 9,943 


1 9,943 


1 0,926 


8,555 


127 


116 


112 


107 


1 hit (2+ aln) 


1,571 


1,571 


4,614 


2,152 


2,026 


161 


73 


78 


124 


2+ hits (1+ aln each hit) 


956 


2,779 


19,111 


7,923 


9,014 


389 


353 


715 


/I/ 


Total 


25,417 


24,293 


43,668 


21,001 


19,595 


677 


542 


905 


948 



The Clementine reference genome was used as the BLAST database. All the oligo sequences were listed in Additional file 2. Aln - Alignment(s); 1 hit (1 aln) - hit 
only 1 scaffold with 1 alignment; 1 hit (2+ aln) - hit on only 1 scaffold but with 2 or more alignments, and 2+ hits (1 +aln) - hit on 2 and more scaffolds with one 
or more alignment to each scaffold. 
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Figure 4 SNP distribution on the Clementine reference genome, using Scaffold_1 as an example. Each interval of the x-axis represented 
500 kb of the scaffold, and the y-axis represented the number of SNPs in each 500 kb on the scaffold. SO - sweet orange (A); TO - trifoliate or- 
ange (B); CM - Clementine mandarin (C); "_a" - counts of all alignments generated by all SNPs; "_]" - counts of SNPs of only 1 unique hit/align- 
ment in the genome. Differences between the "_a" and numbers are observed in several regions of each cultivar. 



that QualitySNP, a haplotype-based SNP mining algorithm 
and pipeline, is a very reliable tool to identify true EST 
SNPs, and it can effectively minimize the false discovery 
rate even without quality files. 

Discussion 

Estimation of heterozygosity of different citrus genomes 
by haplotype-based SNPs 

Many naturally evolved genomes are heterozygous, and 
the heterozygosity level may be evaluated by the rate of 



allelic nucleotide variations between the two haplotypes 
[30]. SNPs, the most abundant polymorphisms in ge- 
nomes, likely are the most appropriate index for the het- 
erozygosity levels of genetically/taxonomically related 
genomes [19,21,22]. Given the different numbers and 
rates of haplotype-based SNPs discovered from these cit- 
rus individuals with substantial numbers of ESTs (for ex- 
ample more than 5,000, Additional file 1), the ratios of 
qSNPs/ESTs in most of them appeared reflective of their 
heterozygous status and genetic background. These 
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hybrid derivatives had much higher qSNPs/ESTs ratio, 
while the other believed "pure" species had lower ratios. 
For example, some proven natural hybrid cultivars, such 
as SO, CM, and recent hybrids such as SC, were among 
the higher qSNPs/ESTs ratios (SO - 5.23%, CM - 8.31%, 
and SC - 7.76%). Other presumed true species, including 
PM, fell in the lower qSNP/ESTs ratios (PM - 0.60%). 
The number of needed ESTs to generate the desired 
number of SNPs in given citrus genotypes, and vice 
versa, can be estimated. Such a tendency, along with the 
ratios and genome heterozygosity, could be strengthened 
and would be more conclusive if the numbers of ESTs in 
all the cultivars were close to each other, or at least in a 
much smaller range. 

SNP discovery and validation rates 

SNP mining is no longer a bottleneck because computa- 
tional capacity and sequence data are exponentially in- 
creasing, and more SNP mining pipelines have become 
available in recent years [7,8,12-15,31]. Hundreds of 
thousands of SNPs can be easily mined out of EST or 
genomic sequences. Inclusion of false SNPs in genotyp- 
ing certainly is wasteful; therefore, maximizing the true 
SNP rate (minimizing the false rate) is the most import- 
ant consideration or requirement for a SNP mining algo- 
rithm because any validation approach can only validate 
these true SNPs, but not false ones [8,13]. We found that 
93% of SNPs identified by the QualitySNP pipeline were 
aligned onto the reference genome at 25/25 or 24/25 
identities, and 81% of randomly selected sweet orange 
SNPs were validated by sequencing and SNaPshot geno- 
typing. It was undetermined whether the others not 
aligned at the two identity rates, and not validated by se- 
quencing and/or genotyping, were true or false SNPs. 
For example, those failing in sequencing validation 
might be due to SBE primer sequences not being found 
(likely an intron in the region), or sequencing failures 
caused by primers of low quality or in a variable region, 
or no nucleotide discrepancies at the sites. It was unclear 
how these SNPs failed in SNaPshot validation; it is spec- 
ulated some of these SBE primers might be incorrectly 
positioned, i.e., the singly extended nucleotides may not 
have been exactly at the SNP sites. There were a few 
such cases identified (Chen et al. unpublished data); very 
likely due to the differences between these consensus 
contigs and the original haplotype sequences. On the 
other hand, only 2 haplotypes may exist in a diploid gen- 
ome. If SNPs were from the contigs with more than 2 
haplotypes, such cases could result from either ESTs 
mixed from diverse genotypes in the same species or 
highly identical paralogs assembled into the contigs. Par- 
alogous genes, resulting from genomic duplication and 
evolving into different functions, are very common in 
many genomes and remain almost identical in their 



conserved regions. ESTs from different paralogous genes, 
if assembled into a same unigene, could yield false SNPs 
that are non-allelic and useless. 

Criteria for selection of citrus core SNP sets 

In most cases the discovered SNPs could easily reach a 
number so large that only a small portion of them, des- 
ignated core SNP set, are selected and used in genotyp- 
ing to meet the restraints in available budget, desired 
platform, applications, and other factors [3,11,32-34]. 
These core sets of different numbers (e.g. 384, 1536, or 
other numbers) are either required by certain SNP geno- 
typing platforms or optimized for particular applications 
[35-38]. It may be a daunting job, but it is necessary to 
establish workable criteria to select any core set of differ- 
ent numbers of SNPs. Based on this complete mining 
and validation process, several attributes of SNPs can be 
very useful and distinguishing to refine these core sets of 
different numbers. SNP oligo alignment uniqueness, 
identity percentage, and distribution in the reference 
genome, co-existence across different genomes, along 
with SNP types (nsSNP vs. sSNP, and transition vs. 
transversion vs. indel) and numbers per gene, should be 
the main criteria for selection of citrus core SNP sets. As 
pointed out, some extra haplotypes might result from 
paralogs across different genome regions. In that case, 
the resulting SNPs would not be allelic or useful. 
Whether they mostly were those SNPs that had multiple 
scaffold hits and alignments remains unclear pending 
further investigation. Those SNPs from either circum- 
stance should be excluded or at least deprioritized for 
use in genotyping. Selection of SNPs for genotyping 
could be difficult when different attributes of SNPs and 
genotyping platforms are considered. A tool based on 
these attributes is being developed to achieve the auto- 
matic selection of core SNP sets for targeted applica- 
tions/platforms [35,36] and to allow geneticists and 
molecular breeders to be able to select and use certain 
core SNPs of interest from among the thousands discov- 
ered [37,38]. All the SNPs (Additional file 2) identified 
in this work are being added to a citrus genome database 
(citrusgenomedb.org). Very recently after this study, an- 
other draft genome of sweet orange was reported, yield- 
ing 1.06 million genome-wide SNPs, about 3.6 SNPs/kb, 
which could be an additional valuable resource in SNP 
applications [39]. 

Conclusions 

High-quality SNPs in public ESTs from different citrus 
genotypes were detected by the QualitySNP pipeline and 
compared to estimate the heterozygosity of each gen- 
ome. All the short SNP oligo sequences were also 
aligned with the Clementine citrus genome to determine 
their distribution and uniqueness in the genome and for 
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in silico validation. Selected SNPs were also validated by 
SNaPshot and sequencing. 

Additional files 



Additional file 1: Table SI. Summary of citrus EST SNPs. It includes 
mining results from 27 individual varieties with their index number, 
binomial name, common name, and abbreviation, 3 grouped ESTs - Ml 2, 
12 mandarins (2-13); L7, 7 limes/lemons (14-20); C27, all 27 citrus varieties 
(1-27); and three summed/averaged results, SMI 2, SL7 and SC27, respect- 
ively from the 12 individually mined mandarins, 7 limes/lemons, and all 27 
varieties, which were used for comparison to Ml 2, L7, and C27. 

Additional file 2: Table S2. 25417 25-mer sequences of SNPs and for- 
ward, reverse, single base extension (SBE) primer, and SBE 5'-tail sequences 
for 96 SNPs selected from sweet orange. 

Additional file 3: Figure SI. SNP distribution on the Clementine 
reference genome Scaffold_2. Each interval of the x-axis represented 
500 kb of the scaffold, and the y-axis represented the number of SNPs in 
each 500 kb on the scaffold. SO - sweet orange (A); TO - trifoliate orange 
(B); CM - Clementine mandarin (C); "_a" - counts of all alignments gener- 
ated by all SNPs; "_1" - counts of SNPs of only 1 unique hit/alignment in 
the genome. Differences between the "_a" and "_1" numbers were observed 
in several regions of each cultivar. 

Additional file 4: Figure S2. SNapShot chromatograph of a SNP 
validated by the assay, generated by GeneMarker (SoftGenetics, State 
College, PA). The y-axis represents the intensity of, and x-axis the approxi- 
mate length of, the fluorescently-labeled SBE products ending with A and G. 
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